In this lesson we will learn how to explore tabular data, perform basic sanity checks and do beautiful visualizations with ggplot2.

Prerequsites

First we will install some R packages containing functions we need. This can be done with install.packages() function.

install.packages("visdat")
install.packages("tidyverse")
install.packages("plotly")
install.packages("ggplot2")

Once the packages are installed, we can load them in the current R session with library() function.

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Now our work space is ready, the only thing missing - is the actual data set that we want to explore and visualize. For this lesson we will use a slightly simplified version of a data set published by Burghard et al 2015.

Read the data

The simplified version of the data set can be found in the course_data folder (burghardt_et_al_2015_expt1.csv). Run the command below to read the data. We will store it in a variable called x.

x <- read.csv("../course_data/burghardt_et_al_2015_expt1.csv") %>%  as.tibble()

Look at the data

To get a glimpse of the data, we can just type in the console the name of the variable we have stored it in (x).

x
## # A tibble: 957 x 15
##    genotype background temperature fluctuation day.length vernalization
##      <fctr>     <fctr>       <int>      <fctr>      <int>        <fctr>
##  1  Col Ama        Col          12         Con         16            NV
##  2  Col Ama        Col          12         Con         16            NV
##  3  Col Ama        Col          12         Con         16            NV
##  4  Col Ama        Col          12         Con         16            NV
##  5  Col Ama        Col          12         Con         16            NV
##  6  Col Ama        Col          12         Con         16            NV
##  7  Col Ama        Col          12         Con         16            NV
##  8  Col Ama        Col          12         Con         16            NV
##  9  Col Ama        Col          12         Con          8            NV
## 10  Col Ama        Col          12         Con          8            NV
## # ... with 947 more rows, and 9 more variables: survival.bolt <fctr>,
## #   bolt <fctr>, days.to.bolt <int>, days.to.flower <int>,
## #   rosette.leaf.num <int>, cauline.leaf.num <int>, blade.length.mm <dbl>,
## #   total.leaf.length.mm <dbl>, blade.ratio <dbl>

This will show you first 10 lines and 6 columns of the data.

Challenge: How many rows and columns does our data set contain?

There are other ways to visually inspect your data:

View(x)
str(x)
## Classes 'tbl_df', 'tbl' and 'data.frame':    957 obs. of  15 variables:
##  $ genotype            : Factor w/ 10 levels "Col Ama","Col FRI",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ background          : Factor w/ 3 levels "Col","Col FRI",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ temperature         : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ fluctuation         : Factor w/ 2 levels "Con","Var": 1 1 1 1 1 1 1 1 1 1 ...
##  $ day.length          : int  16 16 16 16 16 16 16 16 8 8 ...
##  $ vernalization       : Factor w/ 2 levels "NV","V": 1 1 1 1 1 1 1 1 1 1 ...
##  $ survival.bolt       : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ bolt                : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ days.to.bolt        : int  28 29 31 31 32 33 34 35 69 72 ...
##  $ days.to.flower      : int  43 44 43 42 44 47 47 49 90 91 ...
##  $ rosette.leaf.num    : int  18 15 13 17 19 14 15 18 53 49 ...
##  $ cauline.leaf.num    : int  6 5 4 5 4 4 3 5 6 5 ...
##  $ blade.length.mm     : num  12.9 10.5 13.2 14.6 13.3 14.7 13 17.8 20.7 19.4 ...
##  $ total.leaf.length.mm: num  21.1 19.1 23.4 27.2 20.4 25.3 23.2 31.3 33.1 33.2 ...
##  $ blade.ratio         : num  0.611 0.55 0.564 0.537 0.652 ...

Challenge what types of variables do we have in our data set? what could they mean? What type of variable is “bolt”? How about “temperature”?

dim(x)
## [1] 957  15
summary(x)
##        genotype     background   temperature    fluctuation
##  flc-3 FRI :136   Col    :315   Min.   :12.00   Con:504    
##  Col Ama   :135   Col FRI:510   1st Qu.:12.00   Var:453    
##  Col FRI   :128   Ler    :132   Median :12.00              
##  prmt5 FRI :125                 Mean   :16.98              
##  vin3-4 FRI:121                 3rd Qu.:22.00              
##  Ler-1     : 68                 Max.   :22.00              
##  (Other)   :244                                            
##    day.length    vernalization survival.bolt bolt     days.to.bolt   
##  Min.   : 8.00   NV:627        Y:957         N: 59   Min.   : 15.00  
##  1st Qu.: 8.00   V :330                      Y:898   1st Qu.: 38.00  
##  Median :16.00                                       Median : 57.00  
##  Mean   :12.01                                       Mean   : 66.04  
##  3rd Qu.:16.00                                       3rd Qu.: 85.00  
##  Max.   :16.00                                       Max.   :162.00  
##                                                                      
##  days.to.flower   rosette.leaf.num cauline.leaf.num blade.length.mm
##  Min.   : 21.00   Min.   :  5.00   Min.   : 1.000   Min.   : 7.10  
##  1st Qu.: 46.00   1st Qu.: 24.00   1st Qu.: 5.000   1st Qu.:18.00  
##  Median : 66.00   Median : 40.00   Median : 8.000   Median :20.95  
##  Mean   : 71.59   Mean   : 39.71   Mean   : 7.208   Mean   :21.11  
##  3rd Qu.: 92.00   3rd Qu.: 53.00   3rd Qu.: 9.000   3rd Qu.:24.30  
##  Max.   :182.00   Max.   :112.00   Max.   :17.000   Max.   :59.00  
##  NA's   :83       NA's   :95       NA's   :96       NA's   :327    
##  total.leaf.length.mm  blade.ratio    
##  Min.   : 9.00        Min.   :0.0000  
##  1st Qu.:29.10        1st Qu.:0.5564  
##  Median :34.60        Median :0.5948  
##  Mean   :34.69        Mean   :0.5874  
##  3rd Qu.:40.27        3rd Qu.:0.6342  
##  Max.   :66.30        Max.   :6.5556  
##  NA's   :303          NA's   :304

So far, we have already used a handful of R functions, though we have just barely started:

Of course, it is difficult to memorize all the function names, what they are doing and how you should use them. Luckily, R has very convenient built-in help. To use it, type name of a function or any other object you are interested it preceded by ?

?summary

R help might seem cryptic at first, but you will get used to it, you can always scroll down to the ‘examples’ section and try running some of them yourself to get an idea of what the function in question is capable of.

Challenge what does round() function do?

Challenge can you specifically look at the end of your data set instead of it’s beginning? How would you do it in R? (hint: ?tail)

Visual representation of your data

To get a bird-eye view at the data, to identify its structure and potential problems we can simply plot it in Rothko style using vis_dat() function.

vis_dat(x)

Challenge what is the most common data type in our data set ? Can you spot any potential problems?

Missing values

You might have noticed that some of the cells are plotted in gray - these indicate missing values. Missing values can occur in a data set when a certain observation was not collected and cause potential problems in the downstream analysis if we are not careful. There are a couple of strategies how to deal with missing values:

  • remove rows with missing values completely (the safest option, though can result in substantial data loss);
  • ignore missing values when you can (can you?);
  • impute value based on surrounding values (the most risky).

To stay on the safe side now, we will simply remove all the rows with missing values from our beloved x data set.

x <- x %>% drop_na()

Challenge How many rows are left in the data set after we have dropped missing values?

Plots! Plots! Plots!

Now, once we have learned some basics of our data set, we will go straightly to plotting to get even more insights. We will use plotting grammars from ggpplot2 package. Plotting grammars might sound scary, but just think about them as simple building blocks of a plot. By combining and layering several blocks we can create our dream plot for a dream paper or just a lab meeting.

To build a graph we need several blocks:

Let’s focus on the first three: data, aesthetics and geometric object.

Everyone, except excel, likes box plots, so we will start by plotting days. to.flower variable measured for different genotypes. ggplot() initializes a plot, we also give it our x data set and specify aesthetics that should be common for the whole plot, in our case - names for x and y axis.

ggplot(x, aes(genotype, days.to.flower))

Next, we can layer geom_object on top of the created canvas by just adding geom_boxplot().

ggplot(x, aes(genotype, days.to.flower)) +
    geom_boxplot()

Exercise: can you make a violin plot instead? (hint: ?geom_violin)

ggplot(x, aes(genotype, days.to.flower)) +
  geom_violin()

Let’s be honest, the plot looks pretty ugly. We want to add some colour. Colour sound like it should belong to aesthetics, so we put it inside aes() function. Note, you usually don’t need to worry about colours, ggplot does it for you by sampling from its default colour palettes. In this case we want different genotypes to be coloured differently.

ggplot(x, aes(blade.length.mm, rosette.leaf.num, colour = genotype)) + geom_point()

Pretty! Let’s now layer a couple of geom_objects on the same plot. Say, we want to have points for actual values overlayed with box plots:

ggplot(x, aes(genotype, rosette.leaf.num, colour = genotype)) +
    geom_jitter() +
  geom_boxplot(alpha = 0.2)

Note, in this case we have supplied geom_boxplot with additional aesthetics parameter alpha, and only boxes got affected, not the whole plot. This is an example how you could control specific aesthetics of individual geom_objects.

Now, let’s assume we are particularly interested in the relationship between number of rosette leafs and blade length in mm per genotype. To visualize the relationship we generate a scatter plot.

ggplot(x, aes(blade.length.mm, rosette.leaf.num, colour = genotype)) +
    geom_point()

It is pretty, though messy. What if we wanted to isolate each genotype in individual plots? It is easy with ggplot2, we need another layer in our plot, facet, which tells R to plot genotypes separetely and arrange them in a ribbon, side by side.

ggplot(x, aes(blade.length.mm, rosette.leaf.num, colour = genotype)) +
    geom_point() +
    facet_wrap(~genotype)

So much better. But even this is not the limit. We can easily turn our plot in interactive mode, by just wrapping in a special ggplotly() function.

p1 <- ggplot(x, aes(blade.length.mm, rosette.leaf.num, colour = genotype)) + 
  geom_point() +
  facet_wrap(~genotype)
plotly::ggplotly(p1)